A Probabilistic Corpus-Driven Model for Lexical-Functional Analysis

نویسندگان

  • Rens Bod
  • Ronald M. Kaplan
چکیده

1. I n t r o d u c t i o n Data-Oriented Parsing (DOP) models of natural language embody the assumpt ion that human language perception and pr~duction works with representations of past language experiences, rather than with abstract grammar rules (cf. Bod 1992, 95; Scha 1992; Sima'an 1995; Rajman 1995). DOP models therefore maintaiIi hu'gc corpora of linguistic representations of previously occurring utterances. New ut terances arc analyzed by combin ing (arbitrarily large) fragments from the corpus; the occurrence-frequencies of the fragments are used to determine wbich analysis is the most probable one. In accordance with the general DOP architecture outlined by Bod (1995), a particular DOP model is described by specifying settings for the following four |~afatneters'. ,, a formal definition of a well-formed r e p r e s e n tation f o r u t terance attalyses, • a set of d e c o m p o s i t i o n o p e r a t i o n s that divide a given utterance analysis into a set of fragments, • a set of c o m p o s i t i o n o p e r a t i o n s by which such fragments may bc rccombined to derive an analysis of a new utterance, and • a definition of a probab i l i o ' mode l that indicates how the probability of a new utterance analysis is computed on tim basis of the probabilities of the fragments that combine to make it up. ['revious instantiations of the DOP architecture were based on utterance-analyses represented as surface phrase-structure trees ("Tree-DOP", e.g. Bod 1993; 1.(ajman 1995; Sinta'an 1995; Goodman 1996; l{onncma et al. 1997). T ree -DOP uses two decomt)osition operations that produce connected subtrees of utterance representations: (1) the R o o t operation selects any node of a tree to be the root of the new subtrce and erases all nodes except the selected node and the nodes it dominates; (2) the F r o n t i e r operation then chooses a set (possibly empty) of nodes in the new subtree different from its root and erases all subtrees dominated by the chosen nodes. The only composition operation used by TreeI)OP is a node-substitution operation that replaces the left-most nonterminal frontier node in a subtree with a fragment whose root category lnatches the category of tile frontier node. Thus Tree-DOP provides treerepresentations for new utterances by combining fragments from a corpus of phrase structure trees. A Tree-DOP representation R can typically be derived in many different ways. If each derivation D has a probability P(D), then the probability of deriving R is the sum of the individual derivation probabilities: P(R) = Yq) derives R P(D) A Tree-DOP derivation D = is produced by a stochastic branching process, it starts by randomly choosing a fragment t~ labeled with the initial category (e.g. St. At each subsequent step, a next fragment is chosen at random from among the set of competitors for composition into the current subtree. The process stops when a tree results with no nonterminal leaves, l,et C P ( t l C S ) denote the probability of choosing a tree t from a contpetilion set CS containing t. Then the probability of a derivation is P() = [ l i C P ( t i I CSi) where the competition probability CP(t I CS) is given by CP(t I CS) = P(t) / Zt, e CS P(t') Here, P(t) is the fragment probability for t in a given corpus. Let Ti_ 1 = tl ° t2 ° ... o t i . l be the subanalysis just before the ith step of the process, let LNC(Ti_I ) denote the category of the leftmost nonterminal of 7"/-1, and let r(t) denote the root category of a fragment t. Then the competition set at the /tit step is CSi = { t: r ( t )=l ,NC(Ti . 1) } That is, the competit ion sets for Tree-DOP are determined by the ca tegory of the lef tmost nontermina[ of the current subanalysis. This is not the only possible definition of competi t ion set. As Manning and Carpenter (1997) have shown, the competit ion sets can be made dependent on the composition operation. Their left-corner language model would also apply to Tree-DOP, yielding a different definition for the competition sets. But the properties of such Tree-DOP models have not been investigated. Experiments with Tree-DOP on the Penn Treebank and the OVIS corpus show a consistent increase in parse accuracy when larger and more complex subtrees are taken into account (cf. Bod 1993, 95, 98; Bonnema et al. 1997; Sekine & Grishman 1995; Sima'an 1995). However, "Free-DOP is limited in that it cannot account for underlying syntactic (and semantic) dependencies that are not

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Probabilistic Corpus-Driven Approach to Lexical-Functional Representations

The Data-Oriented Parsing (DOP) method suggested by Scha (1990) and developed in Bod (1992–1995) is a probabilistic language processing model which does not single out a narrowly predefined set of structures as the statistically significant ones. It accomplishes this by maintaining a large corpus of analyses of previously occurring utterances. New input is analyzed by combining partial analyses...

متن کامل

Comparing Lexical Bundles in Hard Science Lectures; A Case of Native and Non-Native University Lecturers

Researchers stated that learning and applying certain set of lexical bundles of native lecturers by non-native lecturers would help students improve their proficiency through incidental vocabulary input. The present study shed light on the lexical bundles in hard science lectures used by Native and Non-native lecturers in international universities with the main purpose of analyzing the structu...

متن کامل

The Use of Lexical Bundles in Native and Non-native Post-graduate Writing: The Case of Applied Linguistics MA Theses

Connor et al. (2008) mention “specifying textual requirements of genres” (p.12) as one of the reasons which have motivated researchers in the analysis of writing. Members of each genre should be able to produce and retrieve these textual requirements appropriately to be considered communicatively proficient. One of the textual requirements of genres is regularities of specific forms and content...

متن کامل

Examining the Effect of Ideology and Idiosyncrasy on Lexical Choices in Translation Studies within the CDA Framework

Using a critical discourse analytic model of translation criticism, the present study attempts to explore the effect of ideology and idiosyncrasy on the lexical choices in translation studies. The study employed a descriptive approach to answer two research questions: Is there any relationship between ideology and idiosyncratic features of translators' lexical choices? And if yes, can it be ana...

متن کامل

ACADEMIC WRITING REVISITED: A PHRASEOLOGICAL ANALYSIS OF APPLIED LINGUISTICS HIGH-STAKE GENRES FROM THE PERSPECTIVE OF LEXICAL BUNDLES

Lexical bundles are frequent word combinations that commonly appear in different registers. They have been the subject of much research in the area of corpus linguistics during the last decade. While most previous studies of bundles have mainly focused on variations in the use of these word combinations across different registers and a number of disciplines, not much research has been done to e...

متن کامل

برچسب‌گذاری ادات سخن زبان فارسی با استفاده از مدل شبکۀ فازی

Part of speech tagging (POS tagging) is an ongoing research in natural language processing (NLP) applications. The process of classifying words into their parts of speech and labeling them accordingly is known as part-of-speech tagging, POS-tagging, or simply tagging. Parts of speech are also known as word classes or lexical categories. The purpose of POS tagging is determining the grammatical ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1998